fix: generate scrambled benchmark data with correct per-file RG ranges by zhuqi-lucas · Pull Request #21711 · apache/datafusion

zhuqi-lucas · 2026-04-18T05:05:20Z

Which issue does this PR close?

Related to #21580

Rationale for this change

Both the inexact and overlap benchmark data generation had the same fundamental problem: writing a single parquet file with ORDER BY scramble/jitter causes the parquet writer to merge rows from adjacent chunks into the same RG at chunk boundaries. This widens RG ranges to span the full data range (~6M instead of ~100K), making reorder_by_statistics a no-op — there's nothing meaningful to reorder.

For the overlap benchmark specifically, the jitter formula also preserved ascending RG order by min values, so there was nothing to reorder even without the boundary-merging issue.

What changes are included in this PR?

Replace the single-file ORDER BY approach for both data_sort_pushdown_inexact() datasets with a two-step strategy:

Write a single sorted file with small (100K-row) RGs via DataFusion, producing ~61 RGs with narrow, non-overlapping l_orderkey ranges.
Split into per-RG files using pyarrow and rename with a deterministic permutation so alphabetical file order ≠ l_orderkey order.

Each file has a narrow range (~100K) but appears in scrambled order — reorder_by_statistics has real work to do.

Local benchmark results (reorder feature branch vs upstream/main):

Benchmark	Query	Baseline	With reorder	Change
inexact	Q1 (DESC LIMIT 100)	4.99 ms	4.27 ms	-14%
inexact	Q2 (DESC LIMIT 1000)	2.54 ms	2.26 ms	-11%
inexact	Q3 (SELECT * LIMIT 100)	6.10 ms	5.92 ms	~0%
inexact	Q4 (SELECT * LIMIT 1000)	4.66 ms	5.03 ms	~0%
overlap	Q1 (DESC LIMIT 100)	5.03 ms	3.97 ms	-21%
overlap	Q2 (DESC LIMIT 1000)	2.17 ms	2.23 ms	~0%
overlap	Q3 (SELECT * LIMIT 100)	6.20 ms	6.09 ms	~0%
overlap	Q4 (SELECT * LIMIT 1000)	5.25 ms	4.90 ms	-7%

Are these changes tested?

Benchmark data generation change only. Verified locally by:

Inspecting parquet metadata: each file has narrow ranges (~100K) in scrambled order
Running benchmark comparisons showing clear improvement with reorder_by_statistics

Are there any user-facing changes?

No. Adds pyarrow as a dependency for generating these specific benchmark datasets (pip install pyarrow).

Copilot

Pull request overview

Updates the sort pushdown Inexact overlap benchmark data generator so it actually produces parquet row groups that are both (a) overlapping and (b) out of order, making reorder_by_statistics do meaningful work (matching the streaming / delayed-chunk scenario described in #21580/#21580 follow-ups).

Changes:

Changes overlap dataset generation to order by a deterministic chunk permutation key plus a jittered l_orderkey, producing scrambled + overlapping RGs.
Updates benchmark script messaging/comments to reflect the new generation strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zhuqi-lucas · 2026-04-18T06:34:41Z

Addressed both Copilot comments — the ORDER BY + jitter/scramble approach was fundamentally flawed:

chunk_id overflow: moot now, no more chunk_id computation in SQL.
Comment wording about row-count chunks: also moot.

The root cause was that the parquet writer merges rows from adjacent chunks into the same RG at chunk boundaries, widening RG ranges to ~3M (instead of ~100K). This made reorder_by_statistics ineffective.

New approach: write a single sorted file with 100K-row RGs, then use pyarrow to split into per-RG files with scrambled names. Each file has a narrow range, scrambled order → reorder actually works. Local benchmark shows ~21% improvement on Q1 with the corrected data.

Both the inexact and overlap benchmark data generation had the same fundamental problem: writing a single parquet file with ORDER BY scramble/jitter causes the parquet writer to merge rows from adjacent chunks into the same RG at chunk boundaries, widening RG ranges to span the full data range (~6M instead of ~100K). This makes reorder_by_statistics a no-op — there's nothing meaningful to reorder. Fix both by using a two-step approach: 1. Write a single sorted file with small (100K-row) RGs 2. Use pyarrow to split into per-RG files with scrambled names Each file has a narrow l_orderkey range (~100K) but appears in scrambled alphabetical order, so reorder_by_statistics has real work to do. Local benchmark results (feature branch vs upstream/main): inexact: Q1 (DESC LIMIT 100): 4.99ms -> 4.27ms (-14%) Q2 (DESC LIMIT 1000): 2.54ms -> 2.26ms (-11%) overlap: Q1 (DESC LIMIT 100): 5.03ms -> 3.97ms (-21%) Q4 (SELECT * LIMIT 1000): 5.25ms -> 4.90ms (-7%)

zhuqi-lucas requested review from adriangb and Copilot April 18, 2026 05:05

Copilot started reviewing on behalf of zhuqi-lucas April 18, 2026 05:05 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread benchmarks/bench.sh Outdated

Comment thread benchmarks/bench.sh Outdated

zhuqi-lucas mentioned this pull request Apr 18, 2026

feat: reorder row groups by statistics during sort pushdown #21580

Open

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from ec93c3b to 8fb395a Compare April 18, 2026 06:33

zhuqi-lucas changed the title ~~fix: generate scrambled+overlapping RGs for overlap benchmark~~ fix: generate scrambled benchmark data with correct per-file RG ranges Apr 18, 2026

zhuqi-lucas force-pushed the fix/overlap-benchmark-data branch from 8fb395a to 6aa52df Compare April 18, 2026 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: generate scrambled benchmark data with correct per-file RG ranges#21711

fix: generate scrambled benchmark data with correct per-file RG ranges#21711
zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
zhuqi-lucas:fix/overlap-benchmark-data

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhuqi-lucas commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuqi-lucas commented Apr 18, 2026 •

edited

Loading